dr 2
- Asia > India (0.04)
- North America > United States > Illinois (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Data Science (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Underdamped Langevin MCMC with third order convergence
Scott, Maximilian, O'Kane, Dáire, Jelinčič, Andraž, Foster, James
In this paper, we propose a new numerical method for the underdamped Langevin diffusion (ULD) and present a non-asymptotic analysis of its sampling error in the 2-Wasserstein distance when the $d$-dimensional target distribution $p(x)\propto e^{-f(x)}$ is strongly log-concave and has varying degrees of smoothness. Precisely, under the assumptions that the gradient and Hessian of $f$ are Lipschitz continuous, our algorithm achieves a 2-Wasserstein error of $\varepsilon$ in $\mathcal{O}(\sqrt{d}/\varepsilon)$ and $\mathcal{O}(\sqrt{d}/\sqrt{\varepsilon})$ steps respectively. Therefore, our algorithm has a similar complexity as other popular Langevin MCMC algorithms under matching assumptions. However, if we additionally assume that the third derivative of $f$ is Lipschitz continuous, then our algorithm achieves a 2-Wasserstein error of $\varepsilon$ in $\mathcal{O}(\sqrt{d}/\varepsilon^{\frac{1}{3}})$ steps. To the best of our knowledge, this is the first gradient-only method for ULD with third order convergence. To support our theory, we perform Bayesian logistic regression across a range of real-world datasets, where our algorithm achieves competitive performance compared to an existing underdamped Langevin MCMC algorithm and the popular No U-Turn Sampler (NUTS).
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > United Kingdom > England > Somerset > Bath (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > India (0.04)
- North America > United States > Illinois (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Data Science (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Computational and Statistical Guarantees for Tensor-on-Tensor Regression with Tensor Train Decomposition
Recently, a tensor-on-tensor (ToT) regression model has been proposed to generalize tensor recovery, encompassing scenarios like scalar-on-tensor regression and tensor-on-vector regression. However, the exponential growth in tensor complexity poses challenges for storage and computation in ToT regression. To overcome this hurdle, tensor decompositions have been introduced, with the tensor train (TT)-based ToT model proving efficient in practice due to reduced memory requirements, enhanced computational efficiency, and decreased sampling complexity. Despite these practical benefits, a disparity exists between theoretical analysis and real-world performance. In this paper, we delve into the theoretical and algorithmic aspects of the TT-based ToT regression model. Assuming the regression operator satisfies the restricted isometry property (RIP), we conduct an error analysis for the solution to a constrained least-squares optimization problem. This analysis includes upper error bound and minimax lower bound, revealing that such error bounds polynomially depend on the order $N+M$. To efficiently find solutions meeting such error bounds, we propose two optimization algorithms: the iterative hard thresholding (IHT) algorithm (employing gradient descent with TT-singular value decomposition (TT-SVD)) and the factorization approach using the Riemannian gradient descent (RGD) algorithm. When RIP is satisfied, spectral initialization facilitates proper initialization, and we establish the linear convergence rate of both IHT and RGD.
- North America > United States > Ohio (0.04)
- Europe > Italy (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Research Report (0.63)
- Overview (0.45)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Provable Multi-Party Reinforcement Learning with Diverse Human Feedback
Zhong, Huiying, Deng, Zhun, Su, Weijie J., Wu, Zhiwei Steven, Zhang, Linjun
Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Stochastic Localization via Iterative Posterior Sampling
Grenioux, Louis, Noble, Maxence, Gabrié, Marylou, Durmus, Alain Oliviero
Building upon score-based learning, new interest in stochastic localization techniques has recently emerged. In these models, one seeks to noise a sample from the data distribution through a stochastic process, called observation process, and progressively learns a denoiser associated to this dynamics. Apart from specific applications, the use of stochastic localization for the problem of sampling from an unnormalized target density has not been explored extensively. This work contributes to fill this gap. We consider a general stochastic localization framework and introduce an explicit class of observation processes, associated with flexible denoising schedules. We provide a complete methodology, $\textit{Stochastic Localization via Iterative Posterior Sampling}$ (SLIPS), to obtain approximate samples of this dynamics, and as a by-product, samples from the target distribution. Our scheme is based on a Markov chain Monte Carlo estimation of the denoiser and comes with detailed practical guidelines. We illustrate the benefits and applicability of SLIPS on several benchmarks, including Gaussian mixtures in increasing dimensions, Bayesian logistic regression and a high-dimensional field system from statistical-mechanics.
- North America > United States (0.14)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)
Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
Zhao, Chen, Liu, Shuming, Mangalam, Karttikeya, Qian, Guocheng, Zohra, Fatimah, Alghannam, Abdulmohsen, Malik, Jitendra, Ghanem, Bernard
Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly memory-intensive for tasks with high-resolution data, e.g., video understanding, small object detection, and point cloud analysis. In this paper, we propose Dynamic Reversible Dual-Residual Networks, or Dr$^2$Net, a novel family of network architectures that acts as a surrogate network to finetune a pretrained model with substantially reduced memory consumption. Dr$^2$Net contains two types of residual connections, one maintaining the residual structure in the pretrained models, and the other making the network reversible. Due to its reversibility, intermediate activations, which can be reconstructed from output, are cleared from memory during training. We use two coefficients on either type of residual connections respectively, and introduce a dynamic training strategy that seamlessly transitions the pretrained model to a reversible network with much higher numerical precision. We evaluate Dr$^2$Net on various pretrained models and various tasks, and show that it can reach comparable performance to conventional finetuning but with significantly less memory usage.
On the Optimization and Generalization of Multi-head Attention
Deora, Puneesh, Ghaderi, Rouzbeh, Taheri, Hossein, Thrampoulidis, Christos
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- North America > Canada > British Columbia (0.04)
Fast Interactive Search with a Scale-Free Comparison Oracle
Chumbalov, Daniyar, Klein, Lars, Maystre, Lucas, Grossglauser, Matthias
A comparison-based search algorithm lets a user find a target item $t$ in a database by answering queries of the form, ``Which of items $i$ and $j$ is closer to $t$?'' Instead of formulating an explicit query (such as one or several keywords), the user navigates towards the target via a sequence of such (typically noisy) queries. We propose a scale-free probabilistic oracle model called $\gamma$-CKL for such similarity triplets $(i,j;t)$, which generalizes the CKL triplet model proposed in the literature. The generalization affords independent control over the discriminating power of the oracle and the dimension of the feature space containing the items. We develop a search algorithm with provably exponential rate of convergence under the $\gamma$-CKL oracle, thanks to a backtracking strategy that deals with the unavoidable errors in updating the belief region around the target. We evaluate the performance of the algorithm both over the posited oracle and over several real-world triplet datasets. We also report on a comprehensive user study, where human subjects navigate a database of face portraits.
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
Data-Driven Response Regime Exploration and Identification for Dynamical Systems
Data-Driven Response Regime Exploration and Identification (DR$^2$EI) is a novel and fully data-driven method for identifying and classifying response regimes of a dynamical system without requiring human intervention. This approach is a valuable tool for exploring and discovering response regimes in complex dynamical systems, especially when the governing equations and the number of response regimes are unknown, and the system is expensive to sample. Additionally, the method is useful for order reduction, as it can be used to identify the most dominant response regimes of a given dynamical system. DR$^2$EI utilizes unsupervised learning algorithms to transform the system's response into an embedding space that facilitates regime classification. An active sequential sampling approach based on Gaussian Process Regression (GPR) is used to efficiently sample the parameter space, quantify uncertainty, and provide optimal trade-offs between exploration and exploitation. The performance of the DR$^2$EI method was evaluated by analyzing three established dynamical systems: the mathematical pendulum, the Lorenz system, and the Duffing oscillator. The method was shown to effectively identify a variety of response regimes with both similar and distinct topological features and frequency content, demonstrating its versatility in capturing a wide range of behaviors. While it may not be possible to guarantee that all possible regimes will be identified, the method provides an automated and efficient means for exploring the parameter space of a dynamical system and identifying its underlying "sufficiently dominant" response regimes without prior knowledge of the system's equations or behavior.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- Information Technology > Scientific Computing (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)